Automated identification of borrowings in multilingual wordlists
نویسندگان
چکیده
Although lexical borrowing is an important aspect of language evolution, there have been few attempts to automate the identification borrowings in datasets. Moreover, none solutions which proposed so far identify across multiple languages. This study proposes a new method for task and tests it on newly compiled large comparative dataset 48 South-East Asian languages from Southern China. The yields very promising results, while conceptually straightforward easy apply. makes approach perfect candidate computer-assisted exploratory studies contact areas.
منابع مشابه
LexStat: Automatic Detection of Cognates in Multilingual Wordlists
In this paper, a new method for automatic cognate detection in multilingual wordlists will be presented. The main idea behind the method is to combine different approaches to sequence comparison in historical linguistics and evolutionary biology into a new framework which closely models the most important aspects of the comparative method. The method is implemented as a Python program and provi...
متن کاملAutomated Alignment in Multilingual Corpora
Experiences in computing alignments at the paragraph and sentence level within a project TRANSLEARN in the European Union's "LRE" programme of research and development in language engineering are reported. About 98% of the sentences in pairs of corpora in different languages have been aligned correctly by a method that uses dynamic programming on numbers of characters per sentence. This paralle...
متن کاملUsing Sequence Similarity Networks to Identify Partial Cognates in Multilingual Wordlists
Increasing amounts of digital data in historical linguistics necessitate the development of automatic methods for the detection of cognate words across languages. Recently developed methods work well on language families with moderate time depths, but they are not capable of identifying cognate morphemes in words which are only partially related. Partial cognacy, however, is a frequently recurr...
متن کاملLanguage Identification in Multilingual Documents
Most optical character recognition (OCR) systems can recognize at most a few languages. For large archives of document images that contain different languages, there must be some way to automatically categorize these documents before applying the proper OCR on them. This report presents a research in the identification of English, Chinese, Malay and Tamil in image documents. While most other wo...
متن کاملImproving Automated Alignment in Multilingual Corpora
We report on methods of improving multilingual text alignments that have been produced in a simple dynamic-programming scheme, by automated detection of possible misalignments. Details of methods involving cognates, speciallyidentified words, and propositional contents of sentences are given, together with notable features of their performance on parallel corpora in a number of different types ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Open research Europe
سال: 2022
ISSN: ['2732-5121']
DOI: https://doi.org/10.12688/openreseurope.13843.3